A Linear Least Squares Fit Mapping Method For Information Retrieval From Natural Language Texts
نویسندگان
چکیده
This paper describes a unique method for mapping natural language texts to canonical terms that identify the contents of the texts. This method learns empirical associations between free-form texts and canonical terms from human-assigned matches and determines a Linear Least Squares Fit (LLSF) mapping function which represents weighted connections between words in the texts and the canonical terms. The mapping function enables us to project an arbitrary text to the canonical term space where the "transformed" text is compared with the terms, and similarity scores are obtained which quantify the relevance between the the text and the terms. This approach has superior power to discover synonyms or related terms and to preserve the context sensitivity of the mapping. We achieved a rate of 84~ in both the recall and the precision with a testing set of 6,913 texts, outperforming other techniques including string matching (15%), morphological parsing (17%) and statistical weighting (21%). 1. I n t r o d u c t i o n A common need in natural language information retrieval is to identify the information in free-form texts using a selected set of canonical terms, so that the texts can be retrieved by conventional database techniques using these terms as keywords. In medical classification, for example, original diagnoses written by physicians in patient records need to be classified into canonical disease categories which are specified for the purposes of research, quality improvement, or billing. We will use medical examples for discussion although our method is not limited to medical applications. String matching is a straightforward solution to automatic mapping from texts to canonical terms. Here we use "term" to mean a canonical description of a concept, which is often a noun phrase. Given a text (a "query ~) and a set of canonical terms, string matching counts the common words or phrases in the text and the terms, and choo~s the term containing the largest overlap as most relevant. Although it is a simple and therefore widely used technique, a poor success rate (typically 15% 20%) is observed [1]. String-matchingbased methods suffer from the problems known as "too little" and "too many". As an example of the former, high blood pressure and hypertension are synonyms but a straightforward string matching cannot capture the equivalence in meaning because there is no common word in these two expressions. On the other hand, there are many terms which do share some words with the query high blood pressure, such as high head at term, fetal blood loss, etc.; these terms would be found by a string matcher although they are conceptually distant from the query, Human-defined synonyms or terminology thesauri have been tried as a semantic solution for the "too little" problem [2] [3]. It may significantly improve the mapping if the right set of synonyms or thesaurus is available. However~ as Salton pointed out [4], there is "no guarantee that a thesaurus tailored to a particular text collection can be usefully adapted to another collection. As a result, it has not been possible to obtain reliable improvements in retrieval effectiveness by using thesauruses with a variety of different document collections". Salton has addressed the problem from a different angle, using statistics of word frequencies in a corpus to est imate word importance and reduce the "too many" irrelevant terms [5]. The idea is that "meaningful" words should count more in the mapping while unimportant words should count less. Although word counting is technically simple and this idea is commonly used in existing information retrieval systems, i t inherits the basic weakness of surface string matching. That is, words used in queries but not occurring in the term collection have no affect on the mapping, even if they are synonyms of important concepts in the term collection. Besides, these word weights are determined regardless of the contexts where words have been used, so the lack of sentitivity to contexts is another weakness. We focus our efforts on an algorithmic solution for achieving the functionality of terminology thesauri and semantic weights without requiring human effort in identifying synonyms. We seek to capture such knowledge through samples representing its usage in various contexts, e.g. diagnosis texts with expert-assigned canonical terms collected from the Mayo Clinic patient record archive. We propose a numerical method, a "Linear ACRES DE COLING-92, NANTES, 23-28 AOUT 1992 4 4 7 Paoc, OF COL1NG-92, NANTES, AUG. 23-28, 1992 (a) text/term pairs and the matrix representation tagh grade cmx~id ulceratipn I dr, cry ruplure "-"'7 highgmdegLi°rnit / I maliss~"~"e°vtasml stom~hm~um I I / gastdcinjL~y, [ 0 1 l l g ' / j high o 1 11 i~j~-y l 1 0 0 l rapture 1 0 01 malignant | 0 1 01 stornaeh 1 0 O[ neoplasm [ 0 1 0 l ul~ration 0 0 1 .J rupture L 0 0 1 / matrix A matrix B (b) an LLSF solution W of the linear system W A = B carotid glioma grade high rupture stomach ulceration ~ . I ' 0 . 3 7 5 -0.25 0.t25 0.125 0 0 0.375-] 8as~c / 0 0 0 0 0.5 0.5 0 l injta'Y / 0 0 0 0 0.5 0.5 0 l malignant / -0.25 0.5 0.25 0.25 0 0 -0.25 1 neoplasm | -0.25 0.5 0.25 0.25 0 0 -0.25 / rupture10.375 -0.25 0.125 0.125 0 0 0.375.] IPisure 1. The nmn'ix rep~scntmlon of • text/term pair collection and the mapping function W computed from the collection. Least Squares Fit" mapping model, which enables us to obtain mapping functions based on the large collection of known matches and then use these functions to determine the relevant canonical terms for an arbitrary
منابع مشابه
Automated Indexing of Mammography Reports Using Linear Least Squares Fit
Radiologists routinely document mammography results in free text dictations. In the last decade, there has been an increase in the volume of mammography performed in the U.S. As a result, The American College of Radiology has standardized the practice of screening mammography by introducing a controlled vocabulary and practice standards tracked by audits. Extracting data from these free text re...
متن کاملExtracción crosslingüe de documentos usando mapas semánticos no-lineales
A non-linear semantic mapping procedure is proposed for cross-language document retrieval. The method relays on a non-linear space reduction technique for constructing semantic embeddings of multilingual document collections. In the proposed method, an independent embedding is constructed for each language in the multilingual collection and the similarities among the resulting semantic represen...
متن کاملEngineering Application Of Correlation on Ann Estimated Mass
A functional relationship between two variables, applied mass to a weighing platform and estimated mass using Multi-Layer Perceptron Artificial Neural Networks is approximated by a linear function. Linear relationships and correlation rates are obtained which quantitatively verify that the Artificial Neural Network model is functioning satisfactorily. Estimated mass is achieved through recallin...
متن کاملCross - lingual Information Retrieval Model based on Bilingual Topic Correlation ⋆
How to construct relationship between bilingual texts is important to effectively processing multi-lingual text data and cross language barriers. Cross-lingual latent semantic indexing (CL-LSI) corpus-based doesnot fully take into account bilingual semantic relationship. The paper proposes a new model building semantic relationship of bilingual parallel document via partial least squares (PLS)....
متن کاملUtility Estimation of Health Status of Cancer Patients by Mapping for Cost-Utility Analysis
Background: It is important to obtain accurate information about the preferences of people for measuring quality-adjusted life years (QALYs), because it is necessary for cost-utility analysis. In this regard, mapping is a method to access this information. Therefore, the purpose of this study was to map Functional Assessment of Cancer Therapy – General (FACT-G) onto Short Form Six Dimension (SF...
متن کامل